{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文本解析\n",
    "\n",
    "文本解析器已经很成熟了。它们可以读取文档，并从文件中提取文本。常见的例子包括 PyPDF、PyMUPDF 和 PDFMiner以及很多其他。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## %pip install -qU langchain_community pypdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader \n",
    "\n",
    "file_path = \"./Nvidia_2025.pdf\"\n",
    "loader = PyPDFLoader(file_path)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(docs[1].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OCR 文本解析\n",
    "\n",
    "如果选择像Pytesseract这样的OCR工具，不仅能更有效地捕获文本，还能保留文档的结构。这种方法比基础的文本解析器能更好地保留原始格式和上下文。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from PIL import Image\n",
    "import pytesseract\n",
    "from pdf2image import convert_from_path\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将文件路径作为参数传递\n",
    "pages = convert_from_path(file_path)\n",
    "\n",
    "all_text= \"\"\n",
    "# 确保 i 在页面范围内，避免索引越界\n",
    "for i in range(len(pages)):\n",
    "    filename = f\"page{i}.jpg\"\n",
    "    pages[i].save(filename, 'JPEG')\n",
    "    # 输出文本的文件\n",
    "    outfile = f\"page{i}_text.txt\"\n",
    "    # 使用 with 语句打开文件，确保安全关闭\n",
    "    with open(outfile, \"a\") as f:\n",
    "        text = str(pytesseract.image_to_string(Image.open(filename),lang=\"chi_sim\"))\n",
    "        # 写入文本\n",
    "        f.write(text)\n",
    "        all_text += text + \"\\n\"  # 每页的文本用换行符分隔\n",
    "else:\n",
    "    print(f\"PDF 只有 {len(pages)} 页，无法访问第 {i+1} 页\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 智能文档解析（IDP）\n",
    "一种集成多种技术的文档处理方法，旨在高效地将非结构化文档转换为结构化数据。它可以帮助自动化提取文本和相关信息，并通过诸如OCR、LLM和Markdown格式化等技术来增强解析效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "\n",
    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = getpass.getpass()\n",
    "from llama_parse import LlamaParse\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "\n",
    "documents = LlamaParse(result_type=\"markdown\").load_data(file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(documents[0].get_content())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 构建RAG系统"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores import Chroma\n",
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "import os\n",
    "from langchain import PromptTemplate\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass()\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)\n",
    "\n",
    "template = \"\"\"You are an assistant for question-answering tasks. \n",
    "Use the following pieces of retrieved context to answer the question. \n",
    "If you don't know the answer, just say that you don't know. \n",
    "\n",
    "Question: {question} \n",
    "\n",
    "Context: {context} \n",
    "\n",
    "Answer:\n",
    "\"\"\"\n",
    "\n",
    "prompt = PromptTemplate(\n",
    "    template=template, \n",
    "    input_variables=[\"context\",\"question\"]\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文本解析 RAG 表现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = text_splitter.split_documents(docs)\n",
    "\n",
    "vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings(), collection_name= \"pyparse_db\")\n",
    "base_retriever = vectorstore.as_retriever(search_kwargs={\"k\" : 3})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.schema.runnable import RunnablePassthrough\n",
    "from langchain.schema.output_parser import StrOutputParser\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)\n",
    "\n",
    "rag_chain = (\n",
    "    {\"context\": base_retriever,  \"question\": RunnablePassthrough()} \n",
    "    | prompt \n",
    "    | llm\n",
    "    | StrOutputParser() \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rag_chain.invoke(\"2024财年第一季度的营业收入是多少？\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OCR RAG 表现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_ocr = text_splitter.split_text(all_text)\n",
    "\n",
    "vectorstore_ocr = Chroma.from_texts(text_ocr, OpenAIEmbeddings(), collection_name= \"pyparse_ocr\")\n",
    "base_retriever_ocr = vectorstore_ocr.as_retriever(search_kwargs={\"k\" : 3})\n",
    "\n",
    "rag_chain = (\n",
    "    {\"context\": base_retriever_ocr,  \"question\": RunnablePassthrough()} \n",
    "    | prompt \n",
    "    | llm\n",
    "    | StrOutputParser() \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rag_chain.invoke(\"2024财年第一季度的营业收入是多少？\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IDP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs_idp = text_splitter.split_text(documents[0].get_content())\n",
    "vectorstore_idp= Chroma.from_texts(docs_idp, OpenAIEmbeddings(),collection_name=\"pyparse_idp\")\n",
    "base_retriever_idp = vectorstore_idp.as_retriever(search_kwargs={\"k\" : 3})\n",
    "\n",
    "rag_chain = (\n",
    "    {\"context\": base_retriever_idp,  \"question\": RunnablePassthrough()} \n",
    "    | prompt \n",
    "    | llm\n",
    "    | StrOutputParser() \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rag_chain.invoke(\"2024财年第一季度的营业收入是多少？\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "undefined.undefined.undefined"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
